Wellcome & Disclaimer

This site contains the materials for the Coding tools for Biochemistry & Molecular Biology (Herramientas de Programación para Bioquímica y Biología Molecular) course of fall 2022 in the Bachelor’s Degree in Biochemistry @UAM. This materials are the basis for GitHub-pages-based website that can be accessed here. Detailed academic information about the course contents, dates and assessment only can be found at the UAM Moodle site.

All this material is open access and it is shared under CC BY-NC license.

Plot your data in R

Quick Plotting in R

R has a number of built-in tools for diverse graph types such as histograms, scatter plots, stricharts, bar charts, boxplots, and more. Indeed, there are many functions in R to produce plots ranging from the very basic to the highly complex. We will show a few examples and use the plots as an excuse to add some new tricks for data analysis.

In Lesson 13 we will use the ggplot2 package for efficient generation of complex and cool plots. However, you will see that it’s sometimes useful to use the plotting functions in base R. These are installed by default with R and do not require any additional packages to be installed. They’re quick to type, straightforward to use in simple cases, and run very quickly. Then, if you want to do anything beyond very simple plots, though, it’s generally better to switch to ggplot2. In fact, once you already know how to use R’s base graphics, having these examples side by side will help you transition to using ggplot2 for when you want to make more sophisticated graphics.

The function plot() can be used to generate different types of plots, as detailed in the table and examples below.

Use of plot() function. Source: https://r-coder.com/plot-r/

For the first examples, we are going to review also some R functions to make up data following certain distributions, using functions like rnorm() or dnorm() for generation of normal data or normal density curves. The same can be obtained for Poisson distribution with dpois() and rpois().

#generate some sample data with normal distribution
x <- rnorm(500)
y <- x + rnorm(500)
plot(x)

hist(x)

plot(x, y)

plot(dnorm(0:100,mean=50,sd=5))

#customized Example
plot(x, y, pch = 21,
     bg = "red",   # Fill color
     col = "blue", # Border color
     cex = 3,      # Symbol size
     lwd = 3)      # Border width

Example Plot 1: coli_genomes_renamed.csv

Now let’s plot our data.

coli_genomes <- read.csv2(file = "data/coli_genomes_renamed.csv",
    strip.white = TRUE, stringsAsFactors = TRUE)
attach(coli_genomes)
# one variable
plot(Year)

# a factor
plot(Source)

# two variables
plot(Contigs, kmer)

# factor + numeric variable and saving the plot
png(file = "plot1.png")  #give it a file name
plot(Source, Contigs)  #construct the plot
dev.off()  #save
## quartz_off_screen 
##                 2

As shown in the example, in order to save a plot, we must follow three steps:

  1. Open the file indicating the format: png(), jpg(), svg(), postscript(), bmp(), win.metafile(), or pdf().

  2. Plot the data.

  3. Close the file: dev.off().

Alternatively, you can save the plot using the Plot menu or the Plots panel: Export –> Save as Image or Save as PDF.

As you may know, the R function cor() calculate the correlation coefficient between two or more vectors and cor.test() allow us to quickly perform a correlation test between two variables to know if the correlation is statistically significant. However, a quick plot can be also very useful.

Consider our example coli_genomes_renamed.csv. In that table, we have some features of a list of E. coli isolates and the basic stats of the genome sequencing. Regarding this data, do you think that the number of contigs > 1kb in the genome assemblies (contigs1kb) correlates with the total number of contigs (Contigs), the average contig length (average_contig) or the number of contigs that contain 50% of the genome (N50). Let’s check the data using simple plots with the plot() function.

# correlation plots
par(mfrow = c(3, 1))
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, xlab = "Contigs",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$Contigs), col = "red")
plot(coli_genomes$contigs1kb ~ coli_genomes$N50, xlab = "N50",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$N50), col = "red")
plot(coli_genomes$contigs1kb ~ coli_genomes$average_contig, xlab = "Average contig",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$average_contig),
    col = "red")

# now we will check with a cor.test and add some text to
# the plot
(test1 <- cor.test(coli_genomes$contigs1kb, coli_genomes$Contigs))
## 
##  Pearson's product-moment correlation
## 
## data:  coli_genomes$contigs1kb and coli_genomes$Contigs
## t = 7.651, df = 20, p-value = 2.306e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6945248 0.9420476
## sample estimates:
##      cor 
## 0.863334
(test2 <- cor.test(coli_genomes$contigs1kb, coli_genomes$N50))
## 
##  Pearson's product-moment correlation
## 
## data:  coli_genomes$contigs1kb and coli_genomes$N50
## t = -4.8281, df = 20, p-value = 0.0001021
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8823317 -0.4517569
## sample estimates:
##        cor 
## -0.7336341
(test3 <- cor.test(coli_genomes$contigs1kb, coli_genomes$average_contig))
## 
##  Pearson's product-moment correlation
## 
## data:  coli_genomes$contigs1kb and coli_genomes$average_contig
## t = -7.8453, df = 20, p-value = 1.574e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9444427 -0.7055990
## sample estimates:
##        cor 
## -0.8687624
str(test1)
## List of 9
##  $ statistic  : Named num 7.65
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 20
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 2.31e-07
##  $ estimate   : Named num 0.863
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "coli_genomes$contigs1kb and coli_genomes$Contigs"
##  $ conf.int   : num [1:2] 0.695 0.942
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, xlab = "Contigs",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$Contigs), col = "red")
text(200, 200, paste("Pearson r2=", round(test1$estimate, 2)))
plot(coli_genomes$contigs1kb ~ coli_genomes$N50, xlab = "N50",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$N50), col = "red")
text(250000, 200, paste("Pearson r2=", round(test2$estimate,
    2)))
plot(coli_genomes$contigs1kb ~ coli_genomes$average_contig, xlab = "Average contig",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$average_contig),
    col = "red")
text(40000, 200, paste("Pearson r2=", round(test3$estimate, 2)))

# oh wait! why only one-vs-one?
plot(coli_genomes[, c("VF", "Plasmids", "kmer", "Contigs", "N50",
    "longest.contig..bp.", "Assembly_length", "contigs1kb", "average_contig")],
    main = "Multiple Correlation plot")

Selecting the right plot

Data visualizations are a vital component of a data analysis, as they have the capability of summarizing large amounts of data efficiently in a graphical format and bring out new insights that can be difficult to understand from raw data. There are many chart types available, each with its own strengths and use cases. One of the trickiest parts of the analysis process is choosing the right way to represent your data using one of these visualizations. A wrong plot can lead to confusion or data misinterpretation (see some examples of bad graphs here).

From the perspective of classical data analysis, the first consideration to select the right plot is the nature of your variables. Some plots are recommended for numerical data (quantitative variables) and other for categorical data (qualitative variables). Then, you should considering the role for your data visualization, i.e, question you want to address or the message you want to transmit. That will depend on how many variables, as well as their distribution, grouping or correlation.

You can see a description of some of the more common plots, with examples here or here. Rather than going through all of them, we are going to use some examples to learn some basic plotting and see the effect of plot selection in order to understand your data and use them to answer questions.

Example Plot 2: Zebrafish_data.csv

We are loading again the table described in Lesson 12 and we will try out some plots.

# read data
ZFdata <- read.csv("data/Zebrafish_data.csv")

# some plots
barplot(ZFdata)
## Error in barplot.default(ZFdata): 'height' must be a vector or a matrix
barplot(ZFdata$pLoC)

par(mfrow = c(1, 3))  #arrange the three plots in a row 
barplot(ZFdata$pLoC, col = "red", main = "pLoc")
barplot(ZFdata$EFNA3, col = "green", main = "EFNA3")
barplot(ZFdata$NC1s, col = "blue", main = "NC1s")

par(mfrow = c(2, 3))  #arrange the three plots in a row 
stripchart(ZFdata$pLoC, main = "pLoc")
stripchart(ZFdata$pLoC, method = "stack", col = "red", main = "pLoc")
stripchart(ZFdata$pLoC, method = "overplot", col = "red", main = "pLoc")
stripchart(ZFdata$pLoC, method = "jitter", col = "red", main = "pLoc")
stripchart(ZFdata$EFNA3, method = "stack", col = "green", main = "EFNA3")
stripchart(ZFdata$NC1s, method = "stack", col = "blue", main = "N1cs")

par(mfrow = c(1, 3))  #arrange the three plots in a row 
hist(ZFdata$pLoC, col = "red", main = "pLoc")
hist(ZFdata$EFNA3, col = "green", main = "EFNA3")
hist(ZFdata$NC1s, col = "blue", main = "N1cs")

par(mfrow = c(1, 3))  #arrange the three plots in a row 
boxplot(ZFdata$pLoC, col = "red", main = "pLoc")
boxplot(ZFdata$EFNA3, col = "green", main = "EFNA3")
boxplot(ZFdata$NC1s, col = "blue", main = "N1cs")

Now try to answer some questions about your data and obtained plots:

  1. Which construct has the strongest impact on the disemination of metastatic cells?

  2. Which plot represent best the data?

  3. Does a barplot represent the same information than a boxplot?

Now, we are going to represent the same data again.

oldpar <- par()  #save original settings (optional)
par(mfrow = c(1, 3), cex.lab = 1, cex = 1, lwd = 2)  #new settings

hist(ZFdata$pLoC, col = rgb(1, 0, 0, 0.5), main = "pLoc", ylim = c(0,
    20), xlim = c(0, 120), xlab = "Number of Metastatic cells")
hist(ZFdata$EFNA3, col = rgb(0, 1, 0, 0.5), main = "EFNA3", ylim = c(0,
    20), xlim = c(0, 120), xlab = "Number of Metastatic cells")
hist(ZFdata$NC1s, col = rgb(0, 0, 1, 0.5), main = "NC1s", ylim = c(0,
    20), xlim = c(0, 120), xlab = "Number of Metastatic cells")

stripchart(ZFdata$pLoC, method = "jitter", pch = 19, col = rgb(1,
    0, 0, 0.5), vertical = TRUE, main = "pLoc", ylim = c(0, 120),
    xlab = "Number of Metastatic cells")
stripchart(ZFdata$EFNA3, method = "jitter", pch = 19, col = rgb(0,
    1, 0, 0.5), vertical = TRUE, main = "EFNA3", ylim = c(0,
    120), xlab = "Number of Metastatic cells")
stripchart(ZFdata$EFNA3, method = "jitter", pch = 19, col = rgb(0,
    1, 0, 0.5), vertical = TRUE, main = "EFNA3", ylim = c(0,
    120), xlab = "Number of Metastatic cells")

boxplot(ZFdata$pLoC, col = rgb(1, 0, 0, 0.5), ylim = c(0, 120),
    xlab = "Number of Metastatic cells", main = "pLoc")
boxplot(ZFdata$EFNA3, col = rgb(0, 1, 0, 0.5), ylim = c(0, 120),
    xlab = "Number of Metastatic cells", main = "EFNA3")
boxplot(ZFdata$NC1s, col = rgb(0, 0, 1, 0.5), ylim = c(0, 120),
    xlab = "Number of Metastatic cells", main = "N1cs")

par(oldpar)  #restore (optional)
## Warning in par(oldpar): el parámetro del gráfico "cin" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "cra" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "csi" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "cxy" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "din" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "page" no puede ser
## especificado
# alternatively you can reset par with: par(no.readonly =
# TRUE)

In these plots above, you can see how plots can be deeply customized, and you can select colors (by name, RGB, hex or HSV code), size and shape of the points (cex and pch), plot and axis titles (main, xlab and ylab) and some tricks to avoid points overlapping (method = "jitter" or method="stacked"). Also, we have used the function par() to arrange the plots integrated as a single plot. This is a very useful function that can be used to set or query many graphical parameters (see below). The arguments mfrow and mfcol are very useful to place graphs indicating the row and cols (or cols and row in mfcol). However, any other graph customization can be included here to avoid reapeat it during the plot.

Do you think, you can answer better the questions now?

Beyond some basic examples of plotting in R, the take-home message of this example is that the type of plot and the plot parameters (in this case the scale) can be essential for correct interpretation of the data and if they are not properly adjusted the plot can be strongly misleading.

Further, to compare three conditions, we needed to make three independent plots. However, in the table, there are actually six conditions, and it is easy to imagine a table with more conditions. How could you plot that? Here is where the generation of a datamatrix table, as shown in Lesson 12 is very important.

ZF_stacked <- stack(ZFdata)

boxplot(ZF_stacked$values ~ ZF_stacked$ind, col = c(rgb(1, 0,
    0, 0.5), rgb(0, 1, 0, 0.5), rgb(0, 0, 1, 0.5), rgb(1, 0.5,
    0, 0.5), rgb(0.5, 0.5, 0.5, 0.5), rgb(0, 1, 1, 0.5)))
colorines = c(rgb(1, 0, 0, 0.5), rgb(0, 1, 0, 0.5), rgb(0, 0,
    1, 0.5), rgb(1, 0.5, 0, 0.5), rgb(0.5, 0.5, 0.5, 0.5), rgb(0,
    1, 1, 0.5))  #define the colors for the whole analysis
boxplot(ZF_stacked$values ~ ZF_stacked$ind, col = colorines)

stripchart(ZF_stacked$values ~ ZF_stacked$ind, vertical = TRUE,
    method = "jitter", col = colorines, pch = 19, cex = 1, ylab = "Number of cells",
    xlab = "Plasmid")

# what about barplot?
means <- aggregate(values ~ ind, data = ZF_stacked, mean)  #you can use by() or tapply()
barplot(means$values, ylim = c(0, 120), col = colorines, ylab = "Number of cells",
    xlab = "Plasmid")

The stripchart and the boxplot strongly suggest that NC2s is probably the strongest transcript, but that is not shown in the barplot. These plots clearly show that barplots are intended for single values (categorical data) and can mislead your conclusions.

Plot customization

There are many options to customize your plots, including font type and size, point shape, line type… Let’s see some examples using code I borrow from https://r-coder.com/plot-r/

# point shape with 'pch'
r <- c(sapply(seq(5, 25, 5), function(i) rep(i, 5)))
t <- rep(seq(25, 5, -5), 5)

plot(r, t, pch = 1:25, cex = 3, yaxt = "n", xaxt = "n", ann = FALSE,
    xlim = c(3, 27), lwd = 1:3)
text(r - 1.5, t, 1:25)

# line type with 'lty'
M <- matrix(1:36, ncol = 6)
matplot(M, type = c("l"), lty = 1:6, col = "black", lwd = 3)

# Just to indicate the line types in the plot
j <- 0
invisible(sapply(seq(4, 40, by = 6), function(i) {
    j <<- j + 1
    text(2, i, paste("lty =", j))
}))

# plot box
par(mfrow = c(2, 3))

# plots
plot(x, y, bty = "o", main = "Default")
plot(x, y, bty = "7", main = "bty = '7'")
plot(x, y, bty = "L", main = "bty = 'L'")
plot(x, y, bty = "U", main = "bty = 'U'")
plot(x, y, bty = "C", main = "bty = 'C'")
plot(x, y, bty = "n", main = "bty = 'n'")

par(mfrow = c(1, 1))

Example Plot 3: more customization

Now let’s think again in our E. coli genomes. How would you add more layers of information to the plot, like labels of specific points.

# point labels

# Create the plot
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, pch = 19,
    col = rgb(1, 0, 0, 0.5), xlab = "Contigs", ylab = "Contigs > 1kb")

# Plot the labels
text(coli_genomes$contigs1kb ~ coli_genomes$Contigs, labels = coli_genomes$Strain,
    cex = 0.6, pos = 4, col = "red")

# only some points
selected <- c(7, 17, 18)
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, pch = 19,
    col = rgb(1, 0, 0, 0.5), xlab = "Contigs", ylab = "Contigs > 1kb")

# Index the elements with the vector
text(coli_genomes$contigs1kb[selected] ~ coli_genomes$Contigs[selected],
    labels = coli_genomes$Strain[selected], cex = 0.6, pos = 4,
    col = "red")

# another example
attach(USJudgeRatings)
plot(FAMI, INTG, main = "Familiarity with law vs Judicial integrity",
    xlab = "Familiarity", ylab = "Integrity", pch = 18, col = "blue")
text(FAMI, INTG, labels = row.names(USJudgeRatings), cex = 0.6,
    pos = 4, col = "red")

detach(USJudgeRatings)

You can look for more custom options because there are a lot. I also suggest looking at the identify() function, which allows the quick identification and labeling of selected points.

How would you plot several variables by groups? For this, we can use a loop or an apply family trick (see Lesson 12).

# grouping
par(mfrow = c(1, 3))
# how can you generate several plots for different variable
# by factors??

# option 1: looping
system.time(for (i in 9:11) {
    boxplot(by(coli_genomes[, i], coli_genomes$Source, summary),
        main = names(coli_genomes[i]))
})

##    user  system elapsed 
##   0.027   0.005   0.040
# option2: lapply
ploteando <- function(x) {
    boxplot(by(x, coli_genomes$Source, summary))
}
system.time(lapply(coli_genomes[, 9:11], FUN = ploteando))

##    user  system elapsed 
##   0.040   0.007   0.056

In this example lapply() returns a list of plots but (in this case with only 3 plots) you can notice that it is not faster than a loop. Finally, loops are not so slow on R, at least small loops.

References

Exercises

Under construction!

To Be Updated

Session Info

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] formatR_1.12 knitr_1.40  
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.30   R6_2.5.1        jsonlite_1.8.3  magrittr_2.0.3 
##  [5] evaluate_0.17   highr_0.9       stringi_1.7.8   cachem_1.0.6   
##  [9] rlang_1.0.6     cli_3.4.1       rstudioapi_0.14 jquerylib_0.1.4
## [13] bslib_0.4.0     rmarkdown_2.18  tools_4.2.2     stringr_1.4.1  
## [17] xfun_0.34       yaml_2.3.6      fastmap_1.1.0   compiler_4.2.2 
## [21] htmltools_0.5.3 sass_0.4.2